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Abstract 

In this article, we investigate the use of a probabihstic model for unsupervised clus- 
tering in text collections. Unsupervised clustering has become a basic module for many 
intelligent text processing applications, such as information retrieval, text classification 
or information extraction. 

Recent proposals have been made of probabilistic clustering models, which build "soft" 
theme-document associations. These models allow to compute, for each document, a 
probability vector whose values can be interpreted as the strength of the association 
between documents and clusters. As such, these vectors can also serve to project texts into 
a lower-dimensional "semantic" space. These models however pose non-trivial estimation 
problems, which are aggravated by the very high dimensionality of the parameter space. 

The model considered in this contribution consists of a mixture of multinomial distri- 
butions over the word counts, each component corresponding to a different theme. We 
present and contrast various estimation procedures, which apply both in supervised and 
unsupervised contexts. In supervised learning, this work suggests a criterion for evaluat- 
ing the posterior odds of new documents which is more statistically sound than the "naive 
Bayes" approach. In an unsupervised context, we propose measures to set up a systematic 
evaluation framework and start with examining the Expectation-Maximization (EM) al- 
gorithm as the basic tool for inference. We discuss the importance of initialization and the 
influence of other features such as the smoothing strategy or the size of the vocabulary, 
thereby illustrating the difficulties incurred by the high dimensionality of the parameter 
space. We also propose a heuristic algorithm based on iterative EM with vocabulary re- 
duction to solve this problem. Using the fact that the latent variables can be analytically 
integrated out, we finally show that Gibbs sampling algorithm is tractable and compares 
favorably to the basic expectation maximization approach. 

Keywords: Multinomial Mixture Model, Expectation-Maximization, Gibbs Sampling, 
Text Clustering 



Resume 

Dans cet article, nous presentons une etude detaillee d'un modele probabiliste simple, le 
melange de multinomiales, dans un contexte de classification non-supervisee de collections 
de textes. 

La construction de groupes de documents thematiquement homogenes est une des 
technologies de base de la fouille de texte, et trouve de multiples applications, aussi bien 
en recherche documentaire qu'en categorisation de documents, ou encore pour le suivi 

*This work has been supported by France Telecom, Division R&D, under contract n°42541441. 
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de themes et la construction de resumes. Diverses propositions recentes out ete faites de 
modeles probabilistes permettant de construire de tels regroupements. Les modeles de 
classification probabiliste ont I'avantage de pouvoir egalement etre vus comme des outils 
permettant de construire des representations numeriques synthetiques des informations 
contenues dans le document. Ces modeles posent toutefois des problemes d'estimation 
difficiles, qui sont diis en particulier a la tres grande dimensionalite du vocabulaire. 

Notre contribution a cette famille de travaux est double : nous presentons d'une 
part plusieurs algorithmes d'inference, certains originaux, pour I'estimation du modele 
de melange de multinomiales ; nous presentons egalement une etude systematiquc des 
performances de ces algorithmes, fournissant ainsi de nouveaux outils methodologiques 
pour mesurer les performances des outils de classification non supervisee. 



1 Introduction 

The wide availability of huge collections of text documents (news corpora, e-mails, web pages, 
scientific articles...) has fostered the need for efficient text mining tools. Information retrieval, 
text filtering and classification, and information extraction technologies are rapidly becom- 
ing key components of modern information processing systems, helping end-users to select, 
visualize and shape their informational environment. 

Information retrieval technologies seek to rank documents according to their relevance 
with respect to users queries, or more generally to users informational needs. Filtering and 
routing technologies have the potential to automatically dispatch documents to the appropri- 
ate reader, to arrange incoming documents in the proper folder or directory, possibly rejecting 
undesirable entries. Information extraction technologies, including automatic summarization 
techniques, have the additional potential to reduce the burden of a full reading of texts or mes- 
sages. Most of these applications take advantage of (unsupervised) clustering techniques of 
documents or of document fragments: the unsupervised structuring of documents collections 
can for instance facilitate its indexing or search; clustering a set of documents in response 
to a user query can greatly ease its visualization; con sidering sub-classes ind uced in a non- 
supervised fashion can also improve text classification Jvinot and Yvoni B , etc. Tools for 



building thematically coherent sets of documents are thus emerging as a basic technological 
block of an increasing number of text processing applications. 

Text clustering tools are easily conceived if one adopts, as is commonly done, a bag-of-word 
representation of documents: under this view, each text is represented as a high-dimensional 
vector which merely stores the counts of each word in the document, or a transform thereof. 
Once documents are turned into such kind of numerical representation, a large number of 
clustering techniques become available ( Jain et al. . 19991 ) which allow to group documents 



based on "semantic" or "thematic" similarity. For text clustering tasks, a number of proposal 
have recently been made whic h aim at identi fying pr obabilistic ("soft") theme-document 



associations (see, eg.. iHofmanrk 2001 : tBlei et al. . 2002.: iBuntine and .Takulin . 2QQ4 ). These 



probabilistic clustering techniques compute, for each document, a probability vector whose 
values can be interpreted as the strength of the association between documents and clusters. 
As such, these vectors can also serve to project texts into a lower-dimensional space, whose 
dimension is the number of clusters. These probabilistic approaches are certainly appealing, 
as the projections they build have a clear, probabilistic interpretation; this is in sharp contrast 
with alter native projection techniq ues for text documents, such as Latent Semantic Analy- 
(LSA) dPeerwester et al.l . ll99nh or non-negative matrix factorization (NMF) techniques 
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( VinokoTirovl . 120021 : IShahnaz et ID . hood ). 



In this paper, we focus on a simpler probabilistic model, in which the corpus is repre- 
sented by a r nixture of mu ltinomial distributions, each component corresponding to a differ- 
ent "theme" ( Nigam et all 12000 1. This model is the un supervi sed counterpart of the popular 



"Naiv e Bayes" model for text classification (see, eg., LewieJ . [l998; McCallum and Niganjj, 



1998h . Our main objective is to analyze the estimation procedures that can be used to infer 



the model parameters, and to understand precisely the behavior of these estimation proce- 
dures when faced with high-dimensional parameter spaces. This situation is typical of the 
bag-of-word model of text documents but may certainly occur in other contexts (bioinformat- 
ics, image processing. . . ). Our contribution is thus twofold: 

• we present a comprehensive review of the model and of the estimation procedures that 
are associated with this model, and introduce novel variants thereof, which seem to 
yield better estimates for high-dimensional models, and report a detailed experimental 
analysis of their performance. 

• these analyses are supported by a methodological contribution on the delicate, and 
often overlooked, i ssue of performance evaluation of clustering algorithms (see, eg., 
Halkidi et al.l . EoOlh . roposal here is to focus on a "pure" clustering tasks, where 



the number of themes (the number of dimensions in the "semantic" space) is limited, 
which allows in our case a direct comparison with a reference (manual) clustering. 

This article is organized as follows. We firstly introduce the model and notations used 
throughout the paper. Dirichlet priors are set on the parameters and we may use the 
Expectation-Maximization (EM) algorithm to obtain Maximum A Posteriori (MAP) esti- 
mates of the parameters. An alternative inference strategy uses simulation techniques (Monte- 
Carlo Markov Chains) and consists in identifying conditional distributions from which to 
generate samples. We show, in Section [2.31 that it is possible to marginalize analytically all 
continuous parameters (thematic probabilities and theme-specific word probabilities). This 
result generalizes an o bservation that was used, in the context of the Latent Dirichlet Alloca- 
tion (LDA) model by ( Griffiths and StevversL 20021 ) . We first examine what the consequences 



of this derivation are for supervised classification tasks. We then describe our evaluation 
framework and highlight, in a first round of experiments, the importance of the initialization 
step in the EM algorithm. Looking for ways to overstep the limitations of EM by incremental 
learning, we present an algorithm based on a progressive inclusion of the vocabulary. We 
eventually discuss the application of Gibbs sampling to this model, reporting experiments 
which support the claim that, in our context, the sampling based approach is more robust 
than EM alternatives. 



2 Basics 

Li this section, we present our model of the count vectors. Since we assume that the dis- 
tribution of the words in the document depends on the value of a latent variable associated 
with each text, the theme, we use a multinomial mixture model with Dirichlet priors on the 
parameters. 

We show how this model is related to the naive Bayes classifier and then explain that some 
conditional densities follow another distribution, called "Dirichlet-Multinomial" and how this 
fact proves useful for both classification and unsupervised learning. 
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2.1 Multinomial Mixture Model 

We denote by ud, nw and ut, respectively, the number of documents, the size of the vocab- 
ulary and the number of themes, that is, the number of components in the mixture model. 
Since we use a bag-of-words representation of documents, the corpus is fully determined by 
the count matrix C = {Cwd)w=i...nw,d=^---nD'^ notation is used to refer to the word 
count vector of a specific document d. The multinomial mixture model is such that: 

P(Q|a, (3)=Y,^t ^nj' , n ^""^t" (1) 

t=\ llu;=l "-^w''^- ^=1 

Note that the document length itself (denoted by l^j) is taken as an exogenous variable and 
its distribution is not accounted for in the model. The notations a = (ai, • • • i 
and /3i = {Pu, P2t, ■ ■ ■ , Pnwt) {^^^ t = 1; • • • > ^t) are used to refer to the model parameters, 
respectively, the mixture weights and the collection of theme-specific word probabilities. 

Adopting a Bayesian approach, we set independent noninformative Dirichlet priors on a 
(with hyperparameter Aq > 0) and on the columns f3t (with hyperparameter A/j > 0). The 
choice of the Dirichlet distribution in this context is natural because it is the conjugated dis- 
tribution associated to the multinomial, a property which will be instrumental in Section [2.31 

Therefore we get the following probabilistic generative mechanism for the whole corpus 
C = (Ci . . . Cnjy): 

1. sample a from a Dirichlet distribution with probabilities Aq, . . . , Aq 

2. for every theme t = 1, . . . , ut, sample f3t from a Dirichlet distribution with probabilities 

A/3, . . . , A/3 

3. for every document d = 1, . . . , 

(a) sample a theme Td in {1, ... , ut} with probabilities a = {ai,a2, ■ ■ ■ , an^) 

(b) sample Id words from a multinomial distribution with theme-specific probability 
vector Pxa- 

As all documents are assumed to be independent, the corpus likelihood is given by 

no 

PiC\a,(3) = llF{Cd\a,P) 

d=l 

Now, as the prior distributions are Dirichlet, the posterior distribution is proportional to 
(disregarding terms that do not depend on a or /?): 

p{a,/3\C) oc F{C\a,(3)p{a)p{(3) 

(no nx nw \ it nx nw 

n n n«'""' n n f^tr (2) 
d=l i=l w=l / t=l t=lw=l 

Maximizing this expression is in general intractable. 

We first consider the simpler case of supervised inference in which the themes T = 
(Ti, . . . ,Tn^) associated with the documents are observed. In this situation, inference is 
based on p{a, (3\C, T) rather than p{a, (3\C). In Section [2.21 we briefly recall that maximizing 
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p{a, (3\C, T) with respect to a and P yields the so-called naive Bayes classifier (with Laplacian 
smoothing) . In section 12.31 we tm'n to the so-called fully Bayesian inference which consists 
in integrating with respect to the posterior distribution p{a, f3\C,T). This second approach 
yields an alternative classification rule for unlabeled documents which is connected to the 
Dirichlet-Multinomial (or Polya) distribution. Both of these approaches have counterparts 
in the context of unsupervised inference which will be developed in Sections 14.21 and 14.51 
respectively. 

2.2 Naive Bayes Classifier 

When T is observed, the log-posterior distribution of the parameters given both the documents 
C and their themes T has the simple form: 

logpia, (3\C,T) = ^ USt + Xa - l)logat + Y,iK-'t + - l)log(3^t] (3) 

t=l \ w=l ) 

up to terms that do not depend on the parameters, where St is the number of training 
documents in theme t and K^t is the number of occurrences of the word w in theme t. 

Taking into account the constraints = 1 and X]I^=i f^wt = 1 (for ^ ^ {)-■,■■■ ^nTY)^ 

the maximum a posteriori estimates have the familiar form: 

•St -I- Aq, — 1 " Kyjt + A/3 — 1 

Oit = \ 7T TT Pwt - 



UD + riTiXa - Kt + nw{Xi3 -1) 

were Kt = Y2vjZi ^wt is the total number of occurrences in theme t. 

In the following, we will denote quantities that pertain to a test corpus distinct from 
the training corpus C using the * superscript. Thus C* is the test corpus, a particular 
document in the test corpus, its length, etc. The Bayes decision rule for classifying an 
unlabeled test document, say C^, then consists in selecting the theme t which maximizes 



F{T^ = t\C*a,aJ) = dtl[p: 



nw 



w=l 



{Kt + nw{Xf3-l)yi 

The above formula corresponds to the so-called naive Baves classifier, using Laplacian smq oth- 
ing for word and theme probability estimates ( Lewid . 19981 : McCallum and Nieaml . 1998h . 



2.3 Fully Bayesian Classifier 

An interesting feature of this model is that it is also possible to integrate out the parame- 
ters a and (3 under their posterior distribution allowing to evaluate the Bayesian predictive 
distribution 

P(T,* = t\Cl C,T) = I F{T^ = t\Cl a, f3)p{a, I3\C, T)dad(5 (5) 

From a Bayesian perspective, this predictive distribution is preferable, for classifying the doc- 
ument C^, to the naive Bayes rule given in Tractability of the above integral stems from 
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the fact that p{a, (3\C, T) is a product of Dirichlet distr ibutions - see (1^ . Hence PfC^j T^ = t) 



2002 



follows a so called Dirichlet- Multinomial distribution ( MosimannL 19621 : Minkal . 

To see this, consider the joint distribution of the observations C, the latent variables T 
and the parameters a and (5: 



nw 



wt 



t=l 



w=l 



As the above quantity, viewed as a function of a and Pi, ... , is a product of unnormalized 
Dirichlet distributions, it is possible to integrate out a and /3 analytically. The result of the 
integration involves the normalization constants of the Dirichlet distributions, yielding: 



p(rjC7) 



oc 



oc 



tut -r A/j) 
uit + A/j)] 



(6) 



Now, if we single out the document of index d assuming that the document itself has 
been observed but that the theme is unknown, elementary manipulations yield: 



p(r^ = t\Cd, c_d, T.d) ^{St-i + Xa] 



n:':ir(i^-,'^ + A^) TiElZiiK^t + x,)] 



(7) 



where T-d is the vector of theme indicators for all documents but d, C-d denotes the corpus 
deprived from document d, and K~f is the quantity K~f = Yli{d'^d-Ti=t] ^wd- With suitable 
notation change, this is exactly the predictive distribution as defined in 

Note that, in contrast to the case of the joint posterior probabilities P(T\C) given in Q, 
the normalization constant in ^ is indeed computable as it only involves summation over 
the riT themes. As another practical implementation detail, note that the calculation of 
can be performed efficiently as the special function T (or rather its logarithm) is only ever 
evaluated at points of the form n + A/j or n + nwXjS, where n is an integer, and can thus be 
tabulated beforehand. 

This formula can readily be used as an alternative decision rule in a supervised classifica- 
tion setting. We compare this approach with the use of the naive Bayes classifier in Section |31 
below. Equation is also useful in the context of unsupervised clustering where it provides 
the basis for simulation-based inference procedures to be examined in Section f4. 51 



3 Supervised Inference 

From a Bayesian perspective, the discriminative rule is more principled than the "naive 
Bayes" strategy @ usually adopted in supervised text clustering. In this section, we experi- 
mentally compare these two approaches. 

In C-d is the set of documents whose label is known and Cd is a particular unlabeled 
document. To allow for easier comparison with the naive Bayes classification rule in @, we 
rather denote by C the training corpus, T the associated labels and the test (unlabeled) 
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document. With these notations, ((7| becomes 



p(r,-t|Q,c,T)oc(5, + A.) r[E-,(i^.. + c:, + A,)] 

Comparing with (jlj) we get, after simplification of the Gamma functions, 

' (^t + A.-l) "gf--;;^-;;?'' naive Bayes; 

(5t + A„ ) n^-iJIf°' (■^'"'+^'3+^) fully Bayesian approach. 

n,lo {Kt+nw^n+i) 

If we ignore the offset difference on the hyperparameters (due to the non coincidence of 
the mode and the expectation of the multinomial distribution), note that the two formulas 
are approximately equivalent if: 



c* —1 

(i) All counts are or 1, hence, nj=o' i^wt + A/j + simplifies to {Kyjt + A/j 



(ii) The length of the document is negligible with respect to Kt + nw^/s, therefore, 
Il^Co\Kt + nwX/s + {Kt + nw^/sY*^- 

To assess th e actual differe nce in performance, we selected 5,000 texts from the 2000 
Reuters Corpus ( Reuters| . 2000l ). from five well-defined categories (arts, sports, health, dis- 



asters, employment). In a pre-processing step, we discard non alphabetic characters such as 
punctuation, figures, dates and symbols. For the time being, all words found in the training 
data are taken into account. Words that only occur in the test corpus are simply ignored. 
All experiments are performed using ten-fold cross-validation (with 10 random splits of the 
corpus). To obtain comparable results, we set: 

Aq, — 1 = 1 and A^ — 1 = A in the Naive Bayes case 
Aq, = 1 and A/g = A in the other case ' 

and change the value of A. Figure ^ reports the evolution of the error rate for both classifi- 
cation rules as a function of A. 

Both approaches yield very comparable results. Naive Bayes seems to outperform the 
fully Bayesian approach for larger values of the smoothing parameter while the converse is 
true for smaller values. However, the performance is very similar, since the largest difference 
between the scores of both approaches is around 0.2%, corresponding to one text only in our 
test corpus composed by 500 documents for each fold. 

We also tested on other com mon text clas sification benchmarks such as 20-Newsgroups 
(La,ngl.ll99,5h and Spam Assassin ( Masonl . E)03 v and tried to change the number of documents 



or size of vocabulary and the difference never gets statistically significant. Given that the 
iiaive Baves classifier is known to perfo rm worse than state-of-the-art classification methods 
( Yang and Lii] . I1999I : ISebastianiL l2002l 'l. ;he fully Bayesian classifier does not seem to be 



promising for supervised text classification tasks. This is not surprising as the conditions (i) 
and (ii) discussed above are nearly satisfied in this context. 

We now turn to the unsupervised clustering case, where the fully Bayesian perspective 
will prove more useful. 



7 



6.5 



Naive Bayes 

Fully bayesian approach 




Smoothing parameter 
Figure 1: Error rate as a function of A. 

4 Unsupervised Inference 

When document labels are unknown, the multinomial mixture model may be used to create a 
probabilistic clustering rule. In this context, the performance of the method is more difficult 
to assess. We therefore start this section with a discussion of our evaluation protocol (Sec- 
tion For estimating the parameters of the model, we first consider the most widespread 
approach, which is based on the use of the Expectation- Maximization (EM) algorithm. It 
turns out that in the context of large scale text processing applications, this basic approach is 
plagued by an acute sensitivity to initialization conditions. We then consider alternative esti- 
mation procedures, based either on heuristic considerations aimed at reducing the variability 
of the EM estimates fSection l4.4() or on the use of various forms of Markov chain Monte Carlo 
simulations fSection 14. 5j) and show that these techniques can yield less variable estimates. 



4.1 Experimental Framework 

We will use the same fraction of the Reuters corpus as in Section|21 As will be discussed below, 
initialization of the EM algorithm does play a very important role in obtaining meaningful 
document clusters. To evaluate the performance of the model, one option is to look at the 
value of the log-likelihood at the end of the learning procedure. However, this quantity is 
only available on the training data and does not tell us anything about the generalization 
abilities of the model. A more meaningful measure, commonly used in text applications, is 
the perplexity. Its expression on the test data is: 
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riT 



nw 



d=l t=l w=l 

It quantifies the ability of the model to predict new documents. The normalization by the 
total number of word occurrences /* in the test corpus C* is conventional and used to allow 
comparison with simpler probabilistic models, such as the unigram model, which ignores the 
document level. For the sake of coherence, we will also compute perplexity, rather than log- 
likelihood, on the training data: their variations are in fact equivalent as they are identical 
up to the normalization constant and the exponential function. 

A second indicator, also computable on the training and test data, is obtained by com- 
paring the cooccurrences of documents between "equivalent" clusters in two clusterings. To 
do so, we must have a way to establish the best mapping between clusters in two different 
clusterings. Provided that the t wo clusteririgs have the s ame size, this can be done with 
the so-called Hungarian method ( KuhnL 19551 : Frankl 2004h . an algorithm for computing the 
best weighted matching in a bi-partite graph. The complexity of this algorithm is cubic in 
the number of clusters involved. Once a one-to-one mapping between clusters is established, 
the score we consider is the ratio of documents for which the two clust erings "agree ", that 



is, which lie into clusters that are mapped by the Hungarian method. ( Lanee et al. . 20041 ) 



describes in more detail how this method can be used to evaluate clustering algorithms. 

A limitation of the evaluation with the Hungarian method is that it is not suited to 
compare two soft clusterings with different number of classes and especially the cases where 
one class in clustering A is split into two classes in clustering B. There exist other information- 
based measures built, such as the Relative Information Gain, that do not suffer from this 
limitation but present other drawbacks, such as undesirable behaviors for distributions close 
to equiprobability. We do not consider those here, as cooccurrence s cores obtained with the 
Hungarian method are easier to interpret (see Eigouste et al. . 2005al . for results on the same 
database quantified in terms of mutual information). 



4.2 Expectation-Maximization algorithm 

In an unsupervised setting, the maximum a posteriori estimates are obtained by maximizing 
the posterior distribution given in The resulting maximization program is unfortunately 
not tractable. It is however possible to devise an iterative estimation procedure, based on the 
Expectation-Maximization (EM) algorithm. Denoting respectively by a' and /?' the current 
estimates of the parameters and by the latent (unobservable) theme of document d, it 
is straightforward to check that each iteration of the EM algorithm updates the parameters 
according to: 

F{Td = t\C;a',(3') = "tll^=i^»'* (g) 

riD 

ato^K-l + ^V{Td = t\C;a',0) (10) 

d=l 
riD 

I3u,t^\p-1 + Y, C^d P(Trf = t\C- a', 13') (11) 

d=l 
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where the normahzation factors are determined by the constraints: 

I i:T=i!^u,t = l fortin{l,...,nT}. 

In the remainder of this section, we present the results of a series of experiments based 
on the use of the EM algorithm. We first discuss issues related to the initialization strategy, 
before empirically studying the influence of the smoothing parameters. The main findings of 
these experiments is that the EM estimates are very unstable and vary greatly depending on 
the initial conditions: this outlines a limitation of the EM algorithm, i.e. its difficulty to cope 
with the very high number of local maxima in high dimensional spaces. In comparison, the 
influence of the smoothing parameter is moderate, and its tuning should not be considered a 
major issue. 

4.2.1 Initialization 

It is important to realize that the EM algorithm allows to go back and forth between the values 
of the parameters a and /3 and the values of the posterior probabilities P(Trf = t\C; a, (3) using 
formulas Q, (|10|) and (|llj) . Therefore, the EM algorithm can be initialized either from the 
E-step, providing initial values for the parameters, or from the M-step, providing initial values 
for the posterior probabilities (that is, roughly speaking, an initial soft clustering). There are 
various reasons to prefer the second solution: 

• Initializing (3 requires to come up with a reasonable value for a very large number of 
parameters. A random initialization scheme is out of the question, as it almost always 
yields very unrealistic values in the parameter space; an alternative would be to consider 
small deviations from the unigram word frequencies: it is however unclear how large 
these deviations should be. 

• Initializing on posterior probabilities can be done without any knowledge of the model: 
for instance, it can be performed without knowing the vocabulary size. Section f4. 41 will 
show why this is a desirable property. 

Consequently, in the rest of this article, we will only consider initialization schemes that are 
based on the posterior theme probabilities associated with each document. A good option 
is to make sure that, initially, all clusters significantly overlap. Our "Dirichlet" initialization 
consists in sampling, independently for each document, an initial (fictitious) configuration of 
posterior probabilities from an exchangeable Dirichlet distribution. In practice, we used the 
uniform distribution over the nT-dimensional probability simplex (Dirichlet with parameter 
1). As the EM iterations tend to amplify even the smaller discrepancies between the com- 
ponents, the variability of the final estimates was not significantly reduced when initializing 
from exchangeable Dirichlet distributions with lower variance (ie., higher parameter value). 

To get an idea about the best achievable performance, we also used the Reuters categories 
as initialization. We establish a one-to-one mapping between the mixture components and 
the Reuters categories, setting for each document the initial posterior probability in Q to 1 
for the corresponding theme. Figure 121 displays the corresponding perplexity on the training 
and test sets as a function of the number of iterations. Results are averaged over 10 folds 
and 30 initializations per fold and are represented with box- and- whisker curves: the boxes 
being drawn between the lower and upper quartiles, the whiskers extending down and up to 
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±1.5 times the interquartile range (the outhers, a couple of runs out of the 300, have been 
removed) . 

The variations are quite similar on both (training and test) datasets. The main differ- 
ence is that test perplexity scores are worse than training perplexity scores. This classical 
phenomenon is an instance of overfitting. Due to the way the indexing vocabulary is selected 
(discarding words that do not occur in the training data), this effect is not observed for the 
unigram model^. 

The most striking observation is that the gap between both initialization strategies is huge. 
With the Dirichlet initialization, we are able to predict the word distribution more accurately 
than with the unigram model but much worse than with the somewhat ideal initialization. 
This gap is also patent for the cooccurrence scores with a final ratio of 0.95 for the "Reuters 
categories" initialization and an average around 0.6 for the Dirichlet initialization on test 
data. 



Training data Test data 




EM iterations EM iterations 

Figure 2: Evolution of Perplexity on training and test data over the EM iterations. 

Given that the Dirichlet initialization involves random sampling, it is worth checking how 
the performance change from one run to another. We report in Figure |31 the values of training 
perplexity and test cooccurrence scores for various runs on the first fold^. As can be seen 

^For both models, the fit is better on the training set than on the test set, which should be reflected by 
an increase in perplexity from one dataset to the other. However, as we ignore those (rare) words which only 
appear in the test data, the average probability of the remaining words in this corpus is somewhat artificially 
increased. For the unigram model, which is less prone to overfitting, this effect is the strongest, yielding a 
quite unexpected overall improvement of perplexity from the training to the test set. 

^In the rest of this article, perplexity measurements are only performed on the training data, for test data, 
we use the cooccurrence score as our main evaluation measure. Depending on the readability of the results. 
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more clearly on this figure, the variability from one initialization to another is very high for 
both measures: for instance, the cooccurrence score varies from about 0.4 to more than 0.7. 
This variability is a symptom of the inability of the EM algorithm to avoid being trapped in 
one of the abundant local maxima which exist in the high-dimensional parameter space. 

Training data Test data 
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Figure 3: Evolution of Perplexity and Cooccurrence scores over the EM iterations for different 
Dirichlet initializations for the first fold. 

4.2.2 Influence of the smoothing parameter 

Figure m depicts the influence of the smoothing parameter — 1 in terms of perplexity and 
cooccurrence scores. We do not consider here the influence of — 1, which is, in our context, 
always negligible with respect to the sum over documents of the themes posterior probabilities. 
For the Reuters categories initialization, there is almost no difference in perplexity scores for 
small values of A/3 — 1 (i.e. when A/3 — 1 < 0.2). The performance degrades steadily for 
larger values, showing that some information is lost, probably in the set of rare words (since 
smoothing primarily concerns parameters corresponding to very few occurrences). Similarly, 
for the Dirichlet initialization, the variations in perplexity are moderate for smoothing values 
in the range 0.01 to 1, yet there is a more distinguishable optimum, around 0.2. Using some 
prior information about the fact that word probabilities should not get too small helps to fit 
the distribution of new data, even for words that are rarely (or even never) seen in association 
with a given theme. 

These observations are confirmed by the observation of the test cooccurrence scores. First, 
we either plot all runs, as in Figure |3 or a box-and- whisker curve, as in Figure]^ 
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Figure 4: Evolution of training perplexity and test cooccurrence scores over the smoothing 
parameter A/j — 1. 



except when using very large (5 or more) values of the smoothing parameters, which yields a 
serious drop in performance, the categorization accuracy is rather insensitive to the smoothing 
parameter for the Reuters categories initialization. Of more practical interest however is the 
behavior for the Dirichlet initialization: the variations in performance are again moderate, 
with however a higher optimum value around 0.75. A possible explanation of this observa- 
tion that more smoothing improves categorization capabilities (even if it slightly degrades 
distribution fit) is that the model is so coarse and the data so sparse that only quite frequent 
words are helpful in categorizing; the other words are essentially misleading, unless properly 
initialized. This suggests that removing rare words from the vocabulary could improve the 
classification accuracy. 

All in all, changing the values of — 1 and A^ — 1 does not make the most important 
differences in the results, as long as they remain within reasonable bounds. Thus, in the rest 
of this article, we set them respectively to and 0.1. 

4.3 EM and deterministic clustering 

A somewhat unexpected property of the multinomial mixture model is that a huge fraction 
of posterior probabilities (that a document belongs to a given theme) is in fact very close to 
or 1. Indeed, when starting from the Reuters categories, the proportion of texts classified 
in only one given theme (that is, with probability one, up to machine precision) is almost 
100%. As we start from the opposite point of "extreme fuzziness", this effect is not as strong 
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with the Dh'ichlet initiahzation. Nevertheless, after the fifth iteration, more than 90% of the 
documents are categorized with almost absolute certainty. This suggests that in the context 
of large-dimensional textual databases, the multinomial mixture model in fact behaves like a 
deterministic clustering algorithm. 

This intuition has been experimentally confirmed as follows, implementing a "hard" (de- 
terministic) clustering version of the EM algorithm, in which the E-step uses deterministic 
rather than probabilistic theme assignments. This algorithm can be seen as an instance of 
a i^'-means algorithm, where the similarity between a text d G {1, . . . ,niy} and theme (or 
cluster) t G {1, . . . , nr} is computed as: 

nw 



dist(d,t) = - ^ Cwd^og{Pu,t) + log at 



w=l 



Up to a co nstant term, wh i ch on lv depends on the document, the first term is the Bregman 
divergence (|Baneriee et all . hoO^ ) between a theme specific distribution and the document, 



viewed as an empirical probability distribution over words. This measure is computed for 
every document and every theme, and each document is assigned to the closest theme. The 
reestimation of the parameters P^t is still performed according to (lllj) . where the posterior 
"probabilities" are either or 1. The weight at simply becomes the proportion of documents 
in theme t and P^t the ratio of the number of occurrences of w in theme t over the total 
number of occurrences in documents in theme t. 



J2{d:Ta=t} ^wd 
w=l 2^{d:Ta=t} ^wd 

This algorithm was applied to the same dataset, with the same initialization procedures 
as above. At the end of each iteration, we compute the cooccurrence score between the 
probabilistic clustering produced by EM and the hard clustering produced by this version of 
K-means. 

• With the Reuters Categories initialization, the cooccurrence score between both clus- 
terings is 1 after one iteration. 

• With the Dirichlet initialization, the score between the soft and hard clustering quickly 
converges to 1 and is greater than 0.99 after five iterations. 

In both cases, the outputs of the probabilistic and hard methods become indiscernible after 
very few iterations. We believe that this behavior of EM can be partly explained by the 
large dimensionality of the space of documents^. This assumption has been verified with 
experiments on artificially simulated datasets, which are not reported here for reason of 
space. 

^The vocabulary contains more than 40,000 words. 
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4.4 Improving EM via dimensionality reduction 

In this section, we push further our intuition that removing rare words should improve the 
performance of the EM algorithm and should alleviate the variability phenomenons observed 
in the previous section. After studying the effect of dimensionality reduction, we propose a 
novel strategy based on iterative inference. 

4.4.1 Adjusting the Vocabulary Size 

Having decided to ignore part of the vocabulary, the next question is whether we should 
rather discard the rare words or the frequent words. In this section, we experimentally 
assess these strategies, by removing consecutively tens, hundreds and thousands of terms 
from the indexing vocabulary. The words that are discarded are simply removed from the 
count matrix^. Results presented in Figure El suggest that the performance of the model 
with the Dirichlet initialization can be substantially improved by keeping a limited number 
of frequent words (900 out of 40,000). 

M t M M H I M M M t M M 

Categories (900 words) 

,i li li'l 'i 'I 1 i I i i I i I i i i i i I - 

I I I I I I II ^ I ^ I I I I I . I I I 
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1 10 20 30 
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Figure 5: Evolution of test cooccurrence scores over the EM iterations with a vocabulary of 
size 900. 

When varying the size of the vocabulary, perplexity measurements are meaningless, as the 
reduction of dimensionality has an impact on perplexity which is hard to distinguish from 

''An alternative option, that we do not consider here, would be to replace all the words that do not appear 
in the vocabulary by a generic "out-of- vocabulary" token. The main reason for not using this trick is that 
this generic token tends to receive a non-negligible probability mass; as a consequence, documents containing 
several unknown words tend to look more similar than they really are. 
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the variations due to a possible better fit of the model. The test cooccurrence score, on the 
other hand, is meaningful even when with varying vocabulary sizes. Figure El plots the test 
cooccurrence scores at the end of the 30th EM iteration as a function of the vocabulary size. 
For the sake of readability, the scale of the x-axis is not regular but rather focuses on the 
interesting parts: the interval between 100 and 3, 000 words, which corresponds to keeping 
only the frequent words, and the region above 40, 000 (out of a complete vocabulary of 43, 320 
forms), which corresponds to keeping only the rare words. This choice is motivated by the 
well-known fact that most of the occurrences (and therefore most of the information) are due 
to the most frequent words: for instance, the 3, 320 most frequent words account for about 
75% of the total number of occurrences. 

The upper graph in Figure El shows that removing rare words always hurts when using 
the Reuters categories initialization. In contrast, with the Dirichlet initialization, considering 
a reduced vocabulary (between 300 and 3, 000 words) clearly improves the performance. The 
somewhat optimal size of the vocabulary, as far as this specific measure is concerned, seems 
to be around 1,000. Also importantly, the performance seems much more stable when using 
reduced versions of the vocabulary, an effect we did not manage to achieve by adjusting the 
smoothing parameter. We will come back to this in the next section. It suffices to say here 
that the best score obtained with the Dirichlet initialization is still far behind the performance 
attained with the Reuters categories initialization. This agrees with our previous observation 
that even the rarest word are informative, when properly initialized. 

Less surprisingly, on the lower portion of the graph, one can see that removing the frequent 
words almost always hurts the performance. It is only in the case of the Reuters categories 
initialization that the removal of the 100 most frequent words actually yields a slight improve- 
ment of performance. Then the score steadily decreases with the removal of frequent words. 
The score is almost 0.2 (random agreement) with 20, 000 rare words, which is not surprising, 
as, in this case, the vocabulary mainly contains words occurring only once (so-called hapax 
legomena) in the corpus, reducing texts to at most a dozen of terms. 

To sum-up, there are two important lessons to draw from these experiments: 

• Reducing the dimensionality (vocabulary size) while the number of examples (size of 
the corpus) remains the same helps the inference procedure; 

• Using a reduced vocabulary allows to significantly reduce the variability of the parameter 
estimates. 

We now consider ways to use these remarks to improve our estimation procedure. 
4.4.2 Iterative estimation procedures 

In this section, we look for ways to reduce the variability of the clustering: our main goal 
here being that an end-user should get sufficiently reliable results without having to run the 
program several times and/or to worry about evaluation measures. 

Incremental vocabulary The first idea is to take advantage of our previous observations 
that reducing the dimension of the problem seems to make the EM algorithm less dependent 
on initial conditions. This suggests to obtain robust posterior probabilities using a reduced 
vocabulary, and to use them for initializing new rounds of EM iterations, with a larger vo- 
cabulary. Proceeding this way allows us to circumvent the problem of initializing the [3 
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Figure 6: Evolution of test cooccurrence scores over the size of vocabulary with two different 
strategies: discarding most rare words or discarding most frequent words. 



parameters corresponding to rare words, as we start from the other step of the algorithm 
(the M-step). When the vocabulary size is increased, the probabilities associated with new 
words are implicitly initialized on their average count in the corpus, weighted by the current 
posterior probabilities. This iterative procedure has the net effect of decomposing the inference 
process into several steps, each being sufficiently stable to yield estimates having both a small 
variance and good generalization performance. 

Figure [7| displays the results of the following set of experiments: we perform 15 EM 
iterations with a reduced vocabulary, save the values of posterior probabilities at the end of 
the 15th iteration, and use these values to initialize another round of 15 EM iterations, using 
the full vocabulary. Our earlier results obtained using the full vocabulary are also reported 
for comparison. The influence of the initial vocabulary size is important: as it is increased, 
the maximal score gets somewhat better but the results are more variable. 

These results can be improved by making the estimation process more gradual, thus 
reducing the variability of our estimates. Such experiments are reported in Figure IS] where we 
use four different steps. Proceeding this way, both the maximal and minimal scores lie within 
an acceptable range of performance. It is clear from these experiments that the choice of 
the successive sizes of vocabulary is particularly difficult, being a tight compromise between 
quality and stability. It remains to be seen how to devise a principled approach for finding 
such appropriate vocabulary increments. 
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Figure 7: Evolution of cooccurrence scores on test data with different vocabulary sizes. The 
first 15 iterations are conducted on reduced vocabularies (the size of which is reported on top 
the figures), while the last 15 are use the complete vocabulary. 

Multiple restarts Another usual approach in optimization problems where the large num- 
ber of local optima yields unstable results is to perform multiple restarts and pick up the best 
run according to some criterion. From this point of view, a sensible strategy is to choose the 
vocabulary size yielding the best maximum performance (for instance, Figure [7| suggests that 
starting with 800 words is a reasonable choice), run several trials and select the parameter 
set yielding the best cooccurrence score on the test data. For lack of this information (as 
would be the case in a real-life study, where no reference clustering is available), a legitimate 
question to ask is whether the training perplexity could be used instead as a reliable indicator 
of the quality of parameter settings. The answer is positive, as is shown in Figure |H1 

This figure reports results of the following experiments: after 15 EM iterations using 
a reduced vocabulary of 800 words, we consider the complete vocabulary for another 15 
additional EM iterations. Training set perplexity is computed at the end of iteration 15 and 
at the end of iteration 30. These measurements are repeated 30 times for each of the 10 folds. 
The test cooccurrence scores (somewhat representing the quality of clustering) are plotted 
as a function of this training perplexity in Figure El There is a clear inverse correlation, 
especially in the area of the best runs (low perplexity values-large cooccurrence scores) we 
are interested in. Selecting the run with lowest perplexity yields acceptable performance in 
both cases and the correlation is even stronger on the full vocabulary (it is therefore worth 
performing 15 more iterations with all the words). 

In summary, we have presented in this section two inference strategies which significantly 



18 




Figure 8: Evolution of cooccurrence scores over the different steps of an iterative algorithm 
(30 runs on the same fold). 



improve over a basic implementation of the EM algorithm: 

• split the vocabulary in several bins (at least 4) based on frequency; run EM on the 
smallest set and iteratively add words and rerun EM. 

• discard rare words, run several rounds of EM iterations, keep the run yielding the best 
training perplexity. 

( Eigouste et al. . 2nn5bl ) reports experiments which show that these strategies can be com- 
bined, yielding improved estimates of the parameters. 



4.5 Gibbs sampling algorithm 

In this section, we experiment with an alternative inference method, Gibbs sampling. The 
first subsection presents the results obtained with the most "naive" Gibbs sampling algorithm, 
which is then compared with a Rao-Blackwellized version relying on the integrated formula 
introduced in ((TJ. 



4.5.1 Sampling from the EM formulas 

To apply Gibbs sampling, we first need to identify sets of variables whose values may be 
sampled from their joint conditional distribution given the other variables. In our case, the 
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Figure 9: Correlation between training perplexity and test cooccurrence scores. 



most straightforward way to achieve this is to use the EM update equations ©, (jlU() . ((TT|) . 
Hence, we may repeatedly: 

• sample a theme indicator in {1, . . . ,nT} for each document from a multinomial distri- 
bution whose parameter is given by the posterior probability that the document belongs 
to each of the themes; 

• sample values for a, [i which, conditionally upon the theme indicators, follow Dirichlet 
distributions; 

• compute new posterior probabilities according to 

Figure displays the evolution of the training perplexity and the test cooccurrence 
score for 200 runs of the Gibbs sampler (ran for 10,000 iterations on one fold), compared to 
the regular EM algorithm and the iterative inference method described in Section 14.41 The 
performance varies greatly from one run to another and, occasionally, large changes occur 
during a particular run. This behavior suggests that, in this context, the Gibbs sampler does 
not really attain its objective and gets trapped, like the EM algorithm, in local modes. Hence, 
one does not really sample from the actual posterior distribution but rather from the posterior 
restricted to a "small" subset of the space of latent variables and parameters. Results in terms 
of perplexity and cooccurrence scores are in the same ballpark as those obtained with the 
EM algorithm, several levels below the ones obtained with the ad-hoc inference method of 
Section 
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Figure 10: Evolution of perplexity and cooccurrence scores over the EM-Gibbs Sampling 
iterations. 



4.5.2 Rao-Blackwellized Gibbs Sampling 

There is actually no need to simulate the parameters a. and as they can be integrated out 
when considering the conditional distribution of a single theme given in Q • We then obtain an 
estimate of the distribution of the themes T of all documents by applying the Gibbs sampling 
algorithm to simulate, in turn, every latent theme T^, conditioned on the theme assignment 
of all other documents. This strategy, which aims at reducing the number of dimensions of 
the sampling space, i s known as Rao-Blackwellized sampling, and often produces good results 
( Robert and Gasellal . Il99fll 'l . *^ote that if the document d is one word long, th i s app roach is 



identical to the Gibbs sampling algorithm described in (jGriffiths and Stevver4 E3 for the 
LDA model (using the identity T{a + 1) = aT{a)). 

Figure displays the training perplexity and the test cooccurrence scores for 30 inde- 
pendent random initializations of the Gibbs sampler, compared to the same references as in 
the previous section. We plot results obtained on 200 samples, each corresponding to 10,000 
complete cycles on one fold. The Gibbs sampler outperforms the basic EM algorithm for 
almost all runs. Its performance is in the same range as the iterative method, albeit much 
more variable (the cooccurrence score lies in the range 70% to 95%). The sampler trajectories 
also suggest that the Gibbs sampler is not really irreducible in this context and only explores 
local modes. 

This alternative implementation of the Gibbs sampling procedure is obviously much bet- 
ter than our first, arguably more naive, attempt: not only does it yield consistently better 
performance, but it is also much faster. Thanks to the tabulation of the Gamma function. 
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Figure 11: Evolution of the different measures for Rao-Blackwellized Gibbs Sampling 



the deterministic computations needed for both versions of the sampler are comparable. But 
the Gibbs sampler based on the EM formulas requires generating nx + 1 Dirichlet samples 
(with respective dimensions nw and ut) for a rough total of approximately nwnT Gamma 
distributed variables for the M-step, and nn samples from nr-dimensional discrete distribu- 
tions for the E-step. In comparison, the Rao-Blackwellized Gibbs sampling only requires 
n^-dimensional samples from discrete distributions. The difference is significant: our C-coded 
implementation of the latter algorithm runs 20 times faster than the vanilla Gibbs sampler. 



5 Conclusion 

In this article, we have presented several methods for estimating the parameters of the multi- 
nomial mixture model for text clustering. A systematic evaluation framework based on various 
measures allowed us to understand the discrepancy between the performance typically ob- 
tained with a single run of the EM algorithm and the best scores we could possibly attain 
when initializing on a somewhat ideal clustering. 

Based on the intuition that the high dimensionality incurred by a "bag-of-word" represen- 
tation of texts is directly responsible for this undesirable behavior of the EM algorithm, we 
have analyzed the benefits of reducing the size of the vocabulary and suggested a heuristic in- 
ference method which yields a significant improvement in comparison to the basic application 
of the EM algorithm. 

We have also investigated the use of Gibbs sampling, and proposed two different ap- 
proaches. The Rao-Blackwellized version, which takes advantage of analytic marginalization 
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formulas clearly outperforms the other, more straightforward, implementation. Performance 
obtained with Gibbs sampling are close to the ones obtained with the iterative inference 
method, albeit more dependent on initial conditions. 

Altogether, these results clearly highlight the too often overlooked fact that the inference 
of probabilistic models in high-dimensional spaces, as is typically required for text mining 
applications, is prone to an extreme variability of estimators. 

This work is currently extended in several directions. Further investigations of the multi- 
nomial mixture model are certainly required, notably aiming at (i) analyzing i ts behavior 



when used with very large numbers (several hundreds) number of themes, as in (iBlei et al 
2003); (ii) investigating model selection procedures to see how they can help discover the 



proper number of themes; (iii) reducing the overall complexity of the training: both the EM- 
based and the Gibbs sampling algorithm require to iterate over each document, an unrealistic 
requirement for very large databases. 

Another promising line of research is to consider alternative models: the multinomial 
mixture model can be improved in multiple ways: (i) its modeling of the count matrix is 
unsatisfactory , especiallv as it does no t take in a ccount typical effects of word occurrence 
distributions ( Church and Gale. . 1995: Kat d . Il996l l: this suggests to consider alternative, al- 
beit more complex models of the counts; (ii) the one docu ment-one theme assumptio n is also 
restri ctive, pleading for alternative models such as LDA (|Blei et all . hoO'j ) or GAP (|CannvL 
2nr)4h : preliminary experiments with the former model however suggest that i t might be faced 



with the same type of variability issues as the multinomial mixture model ((Rigouste et al 
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